DOCUMLNT RESUME 



ED 338 629 



TM 013 058 



AUTHOR 
TITLE 



PUB DATE 
NOTE 

PUB TYPE 

EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



Marsh, Herbert W. 

Students' Evaluations of University Teaching: 
Research Findings, Methodological Issues, and 
Directions for Future Research. 
87 
30p- 

Reports - Evaluative/Feasibility (142) 
MF01/PC02 Plus Postage. 

'^College Students; '^Construct V,alidity; Factor 
Analysis? Feedback? Foreign Countries? Higher 
Education? ^Professors? ^Research Methodology? 
*Student Evaluation of Teacher Performance? ^Teacher 
Effectiveness? Test Construction; Test Reliability; 
Test Validity 

Australia; Multidimensional Models; Papua New Guinea; 
Spain; ^Students Evaluation of Educational Quality 



ABSTRACT 

Hr W. Marsh's monograph (1987) on students' 
evaluations of teaching effectiveness in higher education is 
summarized. The research, which emphasized the construct validity 
approach, led to the development of the Students' Evaluations of 
Educational Quality (SEEQ) instrument. Factor analysis resulted in 
identification of nine SEEQ factors — learning value, instructor 
enthusiasm, organization, individual rapport, group interaction, 
breadth of coverage, examinations and grading, assignments and 
readings, and workload difficulty. The analysir encompassed 5,000 
classes conducted for five groups of courses selected to represent 
diverse academic disciplines at the graduate and undergraduate 
levels. Instructors evaluated their own teaching effectiveness on the 
same SEEQ form as that completed by their students. Tertiaiy students 
from different countries evaluated teaching effectiveness with the 
SEEQ. Use of the instrument indica':es that class-average student 
ratings are: (1) multidimensional; (2) reliable and stable; (3) 
primarily a function of the instructor rather than of course content; 
(4) relatively valid against a variety of indicators of effective 
teaching; (5) relatively unaffected by a variety of variables 
hypothesized as potential biases; and (5) perce\ved as useful 
feedback by faculty about their teaching, by students for use in 
course selection, and by administrators for use in personnel 
decisions. Eight data tables and one flowchart are presented. 
(TJH) 
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AB3TRACT 



The purpose of this presentation is to summarize my research 

on students' evaluations of teaching effectiveness in higher 
education. The research led to the development of the 

Students' Evaluations of Educational Quality (SEEQ) 
instrument. These findings indicate that class-average 
student ratings are: 
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students* Evaluations 2 



Ph^pt^fyr 1-E. Ir>t:r-oc ltjc=i:i on - 
H^stPrlP^> Perspecti ve 

Students have evaluated teachers -for as long as there have 
been individuals claiming to be teachers. Prograns of formal 
collection of students' evaluations Here introduced in the 
United States in the 1920s. H. H. Reeeers iniated an 
extensive research prograe in the 1920's that spanned 3 
decades and anticipated eany of the issues presently 
considered. The topic has been one of the most frequently 
studied and contr iversial in Aeerican educational research. 

PurPOees ESUL CgnyCting students' EvaluatlonB 

Students' evaluations of teaching effectiveness are commonly 
collected at most North American universities. Appropriate 
purposes of these evaluations are to providei 

1) diagnostic FEEDBACK to faculty about the effectiveness 
of their teachingi 

2) a measure of teaching effectiveness to be used in 
PERSONNEL DECI8I0NS| 

3) information for students to use in INSTRyCTOR/COURSE 
SELECTION! 

4) an outcome or a process description for RESEARCH ON 
TEACHXNGi 



It Nil: be argued here that students' evaluations as 
typically defined are not appropriate for the evaluation of 

courses — as opposed to the instructors Mho teach the 
courses. 



Construct Vallditv Approach 




My research emphasizes a construct validity approach to the 
study of students' evaluations of teaching and several 
perspectives that underlie this approach: 

tt effective teaching and students' evaluations designed 
to reflect it are mul t i di mensi onal /mul t i f acet ed ; 

»* there is nu single criterion of effective teaching; and 

»* tentative interpretations of relations with validity 
criteria and potential biases must be scrutinized in 
different contexts and must examine multiple criteria of 
efjfective teaching. 




ERIC 



• M 



Studtnts' Evaluations 3 
ChlRPtLfir ZJL. Xtim. Plm»n«lor»^l 1 y St: tji denizes > 

Studant ratings and the teaching that they represent are 
HMLT I P I WENS I ffNftl,. (e.g., a teacher eay be quite Mell organized 
but lack enthusiaee). 



Inforeation froe students' evaluations depends upon the 
content of the items. Poorly Horded or inappropriate items 
will not provide useful information. If a survey instrument 
contains an iU-defined hodge-podge of different ite^ms and 
student ratings are summarized by an average of these items, 
then there is no basis for knowing what is being measured. 



Surveys should contain separate groups of related items which 



1) supported by empirical procedures such as factor 
anal ysi s| 

2) derived from a logical analysis of the content of 
effective teaching and the purposes which the ratings are to 
serve, or a carefully constructed theory; 



Factor Analysis. 

Empirical techniques such as factor analysis provide a test 
of whether: 

1) students differentiate among different components of 
effective teaching! 

2) the empirical factors match the ones the instrument 
was designed to measure; 

3) there is a large halo effect — a generalization from 
some subjective feeling, an external influence or an 

idiosyncratic response mode — that affects responses to all 
the Items. 

t««Factor analysis cannot determine whether the obtained 
factors are important to understanding effective teaching. 
This requires a logical analysis of the content of the 
factors. 
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irPqlgH ftnalYli*, In the development of SEEQs 

1) A large item pool was obtained from a literature 
revieM, forms in current usage, and interviews with -faculty 
and students about what they see as effective teaching; 

2) students and faculty were asked to rate the 
importance of items; 

3) faculty were asked to judge the potential usefulness 
of thtt items as a basis for feedback; 

4) open-ended student comments were examined to determine 
if important aspects had been excluded. 

t»»These criteria, along with psychometric properties, were 
used to select items and revise subsequent versrons! 'tMs 

!!I?!i;??**''.:**2«i°P'"""* constitutes evidence for the content 
validity of 8EEQ and makes it unlikely that it contains any 
trivial factors. 

The SEEQ Factors (and an example item): 

trgftrnlnq/V^lMt i You have found the course intellectually 
challenging and stimulating; 

InntruCtPr Enthuwtfftmi instructor was dynamic and energetic 
in conducting the course; 

Orq#nigit*gn « course materials were well prepared and 
careft:Uy explained; 

Inritv4«iU# L RilPPqrA' instructor was friendly towards 
individ;ial students; 

^£SWl. iDJlTft^tlffng Students were encouraged to participate 
in cla^s discussions; h « 

Prftfttfth Qi. CftV^r^aet instructor presented background or 
origin of ideas/concepts developed in class; 

g?<ft(win;UPn?/era<|ipg .i Feedback on examinations/graded 
materials was valuable; 

eysignment;s/R^a^^nqs .: Readings, homework, etc. contributed 
to appreciation and understanding of subject; 

ttflr'<^Pffd/ftM^^<:uUY8 Course difficulty relative to other 
^classes, was (very easy .. .medium. .. very hard) 



students' Evaluations 5 
f l>P<^or Analytical Results 

rar.tor analyses identify the factors which SEEQ was designed 
to measure, and demonstrate that the students' evaluations 
do measure distinct components of teaching effectiveness. 



1) factor analyses of evaluations from 5,000 classes were 
conducted for 5 groups of courses selected to represent 
diverse academic disciplines at graduate and undergraduate 
levels; each clearly identified the SEEQ factors. 



2) Instructors were asked to evaluate their own teaching 
effectiveness on the same SEEQ form as completed by their 
students. Factor analyses of student ratings and instructor 
self-evaluations each identified the same SEEQ factors. 



3) Tertiary students in different countries (Australia — 
University of Sydney; Australia — TAPE; Papua New Guinea; 
Spain) evaluated teaching effectiveness with SEEQ. Similar 
factors were identified for each of the four groups. The 
items judged to be most important were also similar in these 
very different educational settings. 



tttlhe SEEQ results provide clear support for the 
Multidimonsionality of students' evaluations. Students' 
•valuations cannot torn adequately interpreted if this 
Multidimensionality is ignored. 
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Tabl* 1 i 

NiQSttin iQitru^iioQ^ Bating SlfiSQsions Adaatf^ £rom EsI dntan 11976 

1 

n T»*ch»r'« «ti«ul«tion of int»r»»t in th« coLrse and subject matter. 

2) T»Ach«r's •nthusiasM for subject or for teathing. 

3) Teachwr's knoMl«<lQ» of th« «ubj«ct. 

4) Teacher's intellectual expansi veness and breadth of coverage. 

5) Teacher's preparation and organization of the course. 

6) Clarity and under standableness of presentations and explanations. 

7) Teacher's elocutionary skills. 

8) Teacher's sensitivity to, and concern with, class level and progress. 

9) Clarity of course objectives and requireeents. 

10) Nature and value of the course iMiterial including its usefulness and 
relevance. 

11) Nature and usefulness of supplementary materials and teaching aids. 

12) Difficulty and workload of the course. 

13) Teacher's fairness and impartiality of evaluation of students; quality 
of exauns. 

14) Classroom management. 

15) Nature, quality and frequency of feedback from teacher to students. 

16) Teacher's encouragement of questions and discussion, and openness to the 
opinions of others. 

17) Intellectual challenge and encouragement of independent thought. 
10) Teacher's concern and respect for students; friendliness of of the 

teacher. 

19) Teacher's availability and helpfulness. 

JJgte^ These nineteen categories were originally presented by Feldman 
(1976) but in subsequent studies (e.g., Feldman, 1984) "Perceived 
Outcome or impact of instruction" and -Personal Characteristics 
('Personality')" were added while rating dimension. 12 and 14 presented 
above were not included. 
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Table 2 

Factor Analyses of Students' Evaluations of Teaching Effectiveness (S) 
and the Corresponding Faculty Self-Evaluatiorjs of Their Own Teaching (F) 
in 329. Courses (Reprinted with permission from Marsh, 1984b). 
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Studentf' Evaluations 6 

R«li«bilitv 

The reliability of the class-average response depends upon 

the number o-f students rating the class. The reliability o4 
SEEQ factors is about: 

1) .95 for 50 students/class 

2) .90 for 25 students/class 

3) .74 for 10 students/class 

4) .60 for 5 students/class 

5) .23 for 1 students/class 

*t«tGiven a sufficient number of students, the reliability 
of students' evaluations compares favorably with that of the 
best objective tests. 



Long Term Stabi 1 i tv 

Some critics suggest that students cannot recognize effective 
teaching until after being called upon to apply course 
materials in further coursework or after graduation. 
According to this argument, former students who evaluate 
courses with the added perspective of time will differ 
systematically from students who have just completed a course 
when evaluating teaching effectiveness. However, cross- 
sectional studies have shown good correlational agreement 
between the retrospective ratings of former students and 
those of currently enrolled students. 

In a longitudinal study the same students evaluated 
classes at the end of the course and again several years 
later, at least one year after graduation. End-of-class 
ratings in 100 courses correlated .83 with the retrospective 
ratings (a correlation approaching the reliability of the 
ratings), and the median rating at each time was nearly the 
same. 
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students* Evaluations 7 
Seneral izabi 1 i tv ^ Teacher and Course E-f -f ects 

Researchers have also asked hoM highly correlated student 
ratings are in tNo di-fferent courses taught by the same 
instructor, and in the same course taught by dif-Ferent 
instructors. This research is designed to address two 
related questions. 

1) What is the generality of the construct of effective 
teaching as measured by studentn' evaluations of teaching? 

2) Mhat is the relative importance of the effect of the 
instructor who teaches a class on students' evaluations, 
compared to the effect of the particular class being taught? 
(If the impact of the particular course is large, then the 
practice of comparing ratings of different instructors for 
tenure/promotion decisions may be dubious). 

In order to answer these questions I arranged ratings of 

1364 courses into sets such that each set contained ratings 
of : 

1) the SAME INSTRUCTOR teaching the SAME COURSE on two 

occasions (the correlation was .72 for Overall Instructor 
Rating) j 



2) ths SAME INSTRUCTOR teaching two DIFFERENT COURSES 
(ths corrslation Mas .61)1 

3) ths SAME COURSE taught by a DIFFERENT INSTRUCTOR (the 
correlation was -.OS). 



ttttfi Mors detailsd analysis of thess results shows that 
student ratings priaarily reflect %hsL ef f ecti veneaa of. the 
lIiajLtil£lAJL rather than the influence of the course« 



ERIC 



-I r\ 



Table 3 

Long-Term Stability of Student Eval 



uationss Relative and Absolute 



.9.ee.ent Between End-o.-T.r„ R.tro.p.cti ve Ratings <..,rZl „,t. 
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Table 4 

Correlations Among Dif-ferent Sets of Classes -for Student Ratings and 
Background Characteristics (Reprinted with permission from Marsh, 1984b). 
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Studsntt' Evaluations 8 

Btudant ratings, Mhich constituts one measure o-f 
teaching e-f -f ect i veness , are di-f-ficult to validate since 
there is no single criterion o-f e-f-fective teaching. 

A construct validation approach requires student ratings to 
be: 



1) substantially correlated with a variety o-f other 
indicators o-f e-f-fective teaching; and 

2) less correlated with other variables that are not 

logically related to e-f-fective teaching (Oug., potential 
biases) . 



Other possible criteria o-f e-f-fective teaching would include : 

1) student learning (the most widely accepted); 

2) instructor sel -f-eval uat i ons (so long as ratings are 
not the basis o-f personnel decisions); 

3) evaluations by peers and/or administrators who actually 
attend class sessions; 



4) the -frequency o-f occurrence of specific behaviors 
observed by trained observers; 

5) avaluations of former students at time of graduation or 
several years later | 
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MuUiftctiqn v^iirfj^y 8tudi««. 



Hultisaction courvBs ar« Imrqm cours»s in Mhich 
studant* ar* divided into sAparat* groups (sactions) that 
are independently taught by different instructors according 
to the same course outline and Mith the same iinal 
examination. The critical question is whether those 
instructors Mho receive the best evaluations are the ones 
whose students perform best on the final examination. 

In the ideal multisection validity study: 

1) there are many sections; 

2) students are randomly assigned to sections or at least 

enroll without any knowledge about the sections or who will 
teach themi 

3) there are good pretest measures} 

4) each section is taught completely by a separate 
instructor I 

S> each section has the same course outline^ textbooks, 
course objectives, and final examination ! 

6) the final examination is constructed to reflect the 
common objectives by some person who does not actually teach 
any of the sections, and, if there is a subjective 
component, is graded by an external person. 

Cohen <1981) conducted a meta-analysis of all known 
multisection validity studies of students' evaluations. 
Across 68 multisection courses, student achievement was 
consistently correlated with student ratings of Skill 
(O.SO), Overall Course (0.47), Structure (0.47), Student 
Progress (0.47), and Overall Instructor (0.43). Only 
ratings of Difficulty had a near-zero or a negative 
correlation with achievement. 



Cohen's mata-anal ysi s demonstrates thati sections for which 

in*tr*l«;t;prff aut HYKlmttlQd tOSLS. hlohlv ^ st udents tend %ji ^ 

b*tttr on. ftltndflrdigmtl examinations . This finding supports 
the validity of the ratings. 
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Initructgr 8»lf -Evluations. 

Inatructors* «el f-avaluat i ons arm a good criterion oi 
taaching -f act i v»n«B» ror validating student ratings 
becaunei 

1) They can be collected in all classes where student 
ratings are collected^ 

2) They are likely to be Nidely accepted as one indicator 
uf effective teaching (so long as personnel decisions are 
not tied to the responses)} 

3) Instructors can be asked to evaluate themselves with 
the saee SEED instrument used by their students, thereby 
testing the validity of SEEQ. 

In tNo studies a large number of instructors evaluated their 
ONn teaching on essentially the same SEEQ survey which Mas 
completed by their students. In both studiess 



1) separate factor analyses of teacher and student 
responses identified the same evaluation factors; 

2) student-teacher agreement on every dimension was 
significant (median rs of 0.49 and 0.49). 

3) mean differences between studont and faculty responses 
were small (i.e., student ratings were not systematically 
higher or loNer than faculty sel f -eval uat i ons) . 

4) Student/teacher agreement on ma tching factors (i.e., 
student ratings of Learning/Value and instructor self- 
ratings of Learning/Value was high (median rs of 0.49 & 

0. 49) . 

9) Student/teacher agreement on nonmatchino factors 
(e.g., student ratings of Organization and instructor self- 
ratings of Group Interaction) was low (as it should be). 

ttt«I|iJjL AftJUUL thAt MiiiUk ^-instructor agreement Ls. specific 
ta. ftAClL fftCtOr AIUL Canngt tifi. exDlainftf;^ Jji. terms si ft. 
qgnirrffUgftd •areement. 

These two studies have important implications! 

1) the good student /teacher agreement provides strong 
support for the validity of student ratings; 

2) The specificity of student /teacher agreement to each 
racing factor supports the multidimensional ity of effective 

ERXChing. 
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Ratings gy. PgTf ■ 

P««r ratings, based upon actual classroom visitation, 
ar» oftsn propossd as indicators of effective teaching. In 

studies Mhere peer ratings are NOT based upon classroom 
visitation, ratings by peers agree Hith student ratings. 

However, it is likely that peer ratings are based upon 
information from students. 



Peer ratings that are based upon classroom visitation do 
not appear to be substantially correlated with student 
ratings, any other indicator of effective teaching, or even 
the impressions of other peers. These findings suggest peer 

evaluations should NOT be used for personnel decisions. 

Murray (19B0| p. 4S), in comparing student ratings and peer 
ratings, found peer ratings to bei 

(1) less sensitive, reliable, and valid| 

(2) more threatening and disruptive of faculty morale; and 

(3) more affected by non-instructional factors than 
student ratings. 

Summary sjsA Implications q± Validity Ript^etrghf 

Student ratings are significantly and consistently related 

to a number of varied criteria including the ratings of 
former students, student achievement in multisection 

validity studies, faculty sel f -eval uat i ons of their own 
teaching effectiveness, and, perhaps, the observations of 

trained observers on specific processes such as teacher 
clarity. This provides support for the construct validity of 

the ratings. 

Peer ratings, based upon classroom visitation, and 
research productivity were shown to have little correlation 

with students'" evaluations, and since they are also 
relatively uncorrelated with other indicators of effective 

teaching, their validity as measures of effective teaching 
is problematic. 
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Multitrait-Multimethod Matrix: Correlations B.tween Student Ratings and 
Faculty Self -Evaluation, in 329 Courses (Reprinted with permission from 
Marsh, 1984b). . 

MuUitnit-MuUimethod Matrix: Comtatlon$ Bttwttn Student and Faculty Stl/»Evaiuationt in 329 Counet 



Irmmctof Mlf>mliMUon factor Student avaluaUon factor 



1 2 > 4 6 6 7 8 9 10 11 12 13 14 16 16 i? ig 



Innlnictor acir.evatuationi 

1. Learning/Value 

2. Enlhuniaam 
•t. OrKAniriitinn 
4. Grmip Intcrnclinn 
6. Individual Rapporl 

6. Breadth 

7. Examinationt 

S'^*!?"'?^?^ 24 -01 17 05 22 09 m' (70) 

9. Workload/DifTiculty 03-01 12 H)9 06 -04 09 21 (70) 
Student evalualioni 

10. Learning/Value 46 10 -01 08 -12 09 -04 08 02 



(83) 


^(82) 












29 














12 


01 


(74) 










01 


03 


-16 


(90) 








-07 


-01 


07 


02 


(82) 






13 


12 


13 


11 


-01 


(84) 




-01 


08 


26 


09 


16 


20 


(76) 


24 


-01 


17 


05 


22 


09 


22 


03 


-01 


12 


-09 


06 


-04 


09 


46 


10 


-01 


08 


-12 


09 


-04 


JT 


54 


-04 


-01 


-02 


-01 


-03 


17 


Ti 


30 


-03 


04 


07 


09 


19 


06 


-5o 


62 


00 


-02 


-14 


03 


03 


-06 


13 


28 


-19 


-03 


26 


16 


09 


♦ 00 


-14 


42 


00 


18 


09 


01 


-01 


06 


-59 


17 


20 


03 


02 


09 


«01 


04 


-01 


-06 


-03 


04 


00 


03 


-03 


12 



(95) 



11. Enthuaiaam JT 54 -04 -01 -02 -01 -03 -09 -09 45 (96) 

12. Onaniiation 17 13 g -03 04 07 09 00 -06 62 49 (93) 

13. Group Intoraclion 19 06 -5o 62 00 -02 -14 -04 -08 37 30 21 (98) 

14. Individual Rapport 03 03 -06 13 28 -19 -03 -02 00 22 35 33 42 (96) 

15. Breadth 26 16 09 ♦ 00 -14 42 00 09 02 49 34 66 17 16 (94) 
6. Examinalioni is 09 01 -01 06 -^9 17 -02 -06 48 42 67 34 60 33 * (93) 

17. Aaeicnmenu 20 03 02 09 «01 04 -01 46 12 62 21 34 30 29 40 42 (92) 

18. Workload/D.ffictjlty -06 -03 04 00 03 -03 12 22 69 06 02 -05 -05 08 18 -02 20 (87) 

isulr A N!lTMn'**?!iy^SiIlI <if "PP«' wd lower right matxicee. the two triangular matricai. are reliability (coefficient alpha) eoefTidenU (eee 

if^i un<J«rHn«l valuea In tha diagonal of tha lower left matrix, tha aquarv matrix, are convergent validity coeffidenU that have been oorroctod tor 

unj^^Ubihty according to the Spawman Brown •qualioo. Thanlnaunconoctad validi^ooaakianta,atartingwSLaarn^ ^^^M ^tSjS 

^M, AlloorwUtk»oorfnelairtaa»praa«todwitl»outdadm41polnta. CorraUtioiia imta/ than .lOawrta 
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Ch^BPtgr aJL. R»lait:ion«ihiD t;p Ba>c: kai-cmnd 
Cri^r-jKC=t:»i-i m-tl <=«« Rotiin-t i Bl ataea j n 

8tud»nt;fl'' Ev<Mujt«1iiori« 

Th» construct validity of students' evaluations requires them 
to bei 

1) substantially correlated with indictors of effective 
teaching (i.e.* they are valid>s 

2) but relatively uncorrelated with variables that are not 
(i.e., they are not biased). 

Hy research indicates that student ratings are not 
substantially influenced by potential biases, but that 
faculty still believe that they are. 

In a survey I conducted at the university where SEEQ was 
developed faculty indicated that student ratings were useful 
and that teaching quality should be given sore emphasis in 
personnel decisions. Nevertheless they felt that student 
ratings were biased and other measures of teaching 
effectiveness are even more biased. 

*»»A dilemma existed in that faculty wanted teaching to be 
evaluated, but were dubious about any procedure to accomplish 
this purpose. 

HiUa. HufiJi. Eo Potential Biases Affect Students' Evaluations. 

In several large studies the combined effect of a large 
number of potential biases was able to explain a total of 
between 3% and 20X of the variance in student ratings. 

Student ratings were positively correlated with Prior Subject 
Interest, Expected Grades, and Mor kload/Dif f iculty, and 
specific components of the ratings (e.g., Sroup Interaction 

and Individual Rapport) are negatively correlated with class 
size. 

The size of the influence of background characteristics is 
not huge, but large enough to be worrisome IF THESE 
RELATIONSHIP REALLY REPRESENT BIASES TO STUDENT RATINGS. 
However, a more detailed examin ition of the effects suggests 
that the relations represent the influence of variables that 
really do affect teaching effectiveness in a way that is 
validly reflected in the student ratings. 
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ynRkL OA D/DIFFICULTY EFFECT. Par Adox i cal 1 y , at least based 
SpSn thS ;ippo;itlL --Tht^ i« - potential 

••bias" to student ratings, higher levels oi 

Workload/Di-ff iculty were positively correlated Nith student 
ratings. 

t«««8ince the direction o* th Workload/Difficulty effect is 
opposite to that predicted as a potential bias effect, 
Norkload/dif f iculty does not appear to constitute a bias to 

student ratings. 



CLASS SIZE EFFECT . Class size is negatively correl ited with 
student ratings of Group Interaction and Individual Rapport 
but not with other SEEQ factors. Similarly, class size is 
negatively correlated with instructor sel f -aval uat i ons for 
these two factors but not other SEEQ factors. 

$«»tThe findings argue that class size does have a moderate 
effect on these two aspects of effective teaching and these 
effects are accurately reflected in the student ratings. 



PRIOR SUBJECT INTEREST EFFECT- The effect o-f Prior Subject 
Interest on SEEQ scores was greater than that of any of the 
13 other background variables that I considered. For both 
student ratings and instructor self -evaluations. Prior 
Subject Interest was most highly correlated with 
Learning/ Value. 



ttttAgain the findings suggest that Prior Subject Interest is 
« variable which influences some aspects of effective 
teaching (particularly Learning/Value) and these effects are 
accurately reflected in both the student ratings and 
instructor self -evaluations. Higher student interest in the 
subject apparently creates a more favorable learning 
environment and facilitates effective teaching, and this 
effect is reflected in student ratings as well as instructor 

self -evaluations. 
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ExD»ct«d 6rad»», CI ass~av«r ag« expsctuid grades ar» positively 
correlated Nith student ratings. There are, however, three 

quite di'fferent explanations for this findingt 

1) The ''grading leniency hypothesis'* proposes th^^t 
instructors who give higher-than-deserved grades will receive 
higher~than~deserved student ratings, and represents a 

serious bias. 

2) The "validity hypothesis'* proposes that better Expected 

Grades reflect better student learning, and that a positive 
correlation between student learning and student ratings 

supports the validity of student ratings. 

3) A "student characteristics hypothesis" proposes that 

pre-existing student characteristics may affect student 
learning, student grades, and teaching effectiveness, so that 

the expected grade effect can be explained in teres of other 
variables. 



Mhile these explanations of the expected grade effect 
have quite different implications, tb" y are not mutually 

exclusive. The grade a student receives is likely to be 
related to the grading leniency of the teacher, how much 

he/she learned, and characteristics that he/she brought into 
the course. Not surprisingly there is some support for each 

explanation. 

tttlt is possible that a grading leniency effect may produce 

a bias in student ratings, but support for this suggestion is 
weak and the size of such would be small. 



„ . Table 5.2 I 

Path Analysis Model Rclatinr Prior Subject Interest, Reason for Taking Course. Expected Grade and WorkI jadJ 
Difficulty to Student Ratings (Reprinted with permission from Marsh. 1984b) 



Factor 



I. Prior Subject 11. Reason (General III. Expected IV. Workload/ 
Interest Interest Only) Course Grade 



Difficulty 



Student ratings DC TC 



LearningA^alue 
Enthusiasm 
Oiganization 
Group Interaction 
Individual Rapport 
Breadth 
Exams/Grading 
Assignments 
Overall course 
Overall instructor 
Variance 
components* 



36 
17 
-04 
21 
-06 
-07 
-06 
11 
23 
12 



44 

23 
-04 
28 
09 
-03 
03 
19 
32 
20 



Orig 



44 

23 
-03 
29 
09 
-03 
03 
20 
33 
20 



DC TC 



16 
09 
16 
06 
-01 
23 
12 
21 
19 
13 



13 
08 
16 
06 
-02 
19 
10 
17 
16 
11 



Orig 



16 
09 
16 
07 
-02 
19 
10 
18 
16 
12 



DC TC 



26 

20 

03 

30 

18 

06 

26 

19 

26 

24 



20 
16 
02 
27 
16 
-01 
18 
09 
15 
17 



Orig 



29 
20 
01 
31 
17 
-02 
18 
13 
22 
20 



DC TC 



17 
11 
04 

06 
06 
21 
20 
30 
30 
17 



17 
11 
04 



Orig 



12 

06 
00 



06 -02 
06 01 



21 
20 
30 
30 
17 



2.9% 6.1% 6.3% 2.3% 1.6% 1.8% 4.6% 2.6% 4.0% 3.6% 3.6% 1. 



16 
10 
23 
23 
10 

8% 



J Calculated by summing the squared coefficients, dividing by the number of coefficients, and multiplying by 



I 

Prior Subject 
Interest 



p -+0.21 



D « + 0.20 



Reason for 
Taking Course 
(General Interest) 
i 



— p = - O.iu 



Expected 
Grade 



T 



Workload/ 
Difficulty 




p = - 0.3^1 Student 
i /rTN;^! Ratings 



Figure 5.1 Path analysis model relating prior subject interest, reason for taking course, expected grade, and 
Workload/Difficulty (Path coefficients for the student rating factors appear in Table 5.2; reprinted with permis- 
sion from Marsh, 1984b) 
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students' Evaluations IS 
Ph#B^5^r ^lSl. " Pr- ■ Fox " Stiijidl e« 

ThB Dr. Fox e-f-fect is definsd as the overriding in-flusnce of 
instructor expressiveness on students' evaluations. In the 

oriQin«(l Dr. Fox study, a professional actor was favorably 
evaluated Nhen he lecturered in an enthusiastic/expressive 
manner, even though he presented material of little 
educational value. The authors of the study and critics agree 

that it had serious methodological problems. 

To overcome some problems Mare and (Williams developed the 
standard Dr. Fox paradigm in which a series of experimentally 
manipulated lectures Mere videotaped. Lectures varied in the 
content coveragfj and the expressiveness of delivery. Students 

viewed one lecture, evaluated teaching effectiveness, and 
completed an Achievement test based on all the material in 

the hi gh'-content lecture. Expressiveness affected student 
ratings more than did content, whereas content affected 

achievement test scores more than expressiveness (see meta- 
analysis by Abrami, et al., 1982). 

A Reanal vsi s. 

Marsh and Ware (1982) reanalyzed data from the Mare and 

Williams studies. A factor analysis of the rating instrument 
identified five factors which varied in the way they were 

affected by the manipulations. In the condition most like the 
university classroom (students knew about the test and a 

reward prior to viewing the lecture) THE DR. FOX EFFECT WA3 
NOT FOUND - The instructor expressiveness manipulation only 

affected rating of Instructor Enthusiasm, the factor most 
logically related to that manipulation. Content coverage 

significantly affected ratings of Instructor Knowledge and 
Organization/Clarity, factors most logically related to that 
mani pul at i on . 

When students had no incentive to perform and did not know 
they would be tested, instructor expressiveness had a much 
larger affect on all five student rating factors. In this 
condition, however, expressiveness also had a larger impact 
on test scores than the content manipulation. This is one of 
the few studies to demonstrate that instructor expressiveness 
causes better examination performance. 

How Should the Dr. Fox Effect Be Interpreted? 

These results are frequently used to argue for the invalidity 
of student ratings but my interpretation is quite different. 

Using a construct validity approach, a specific rating factor 
should be substantially influenced by manipulations most 
logically related to it and less influence by other 
manipulations. This interpretation offers strong support to 
'•^Q validity of student ratings with respect to instructor 
ERJC »««iv«ness and limited support to their validity with 

reSDect to (ran¥mn¥ . ^ ^ 



Table 6.1 

Effect Sizes of Expressiveness. Content, Sxpressiveness x Content Interaction in Each of fhc Three Incentive 

Conditions (Reprinted with permission from Marsh, 1984b) 

CondiUon Exprwsiveneis (%) Content (%) InUraction (%) 



No External Incentive 
Clarity/Organization 
Instructor Concern 
Instructor Knowledge 
Instructor Enthusiasm 
Learning Stimulation 
Total rating (across all items) 
Achievement test scores 

Incentive After Lecture 
Clarity/Organization 
Instructor Concern 
Instructor Knowledge 
Instructor Enthusiasm 
Learning Stimulation 
Total rating (across all items) 
Achievement test scores 

Incentive Before Lecture 
Clarity/Organization 
Instructor Concern 
Instructor Knowledge 
Instructor Enthusiasm 
Learning Stimulation 
Total rating (across all items) 
Achievement test scores 

Across All Incentive Conditions 
Clarity/Organization 
Instructor Concern 
Instructor Knowledge 
Instructor Enthusiasm 
Learning Stimulation 
Total rating (across all items) 
Achievement test scores 



11.3** 
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1 it 
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13.0* • 


9.6** 


1.5 


25.4** 




3.3** 


9.4** 


5.2* • 


1.3 
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2.0 


n.4*» 
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26.5*» 
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2.1** 


5.0*» 


1.6* 
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1.0 


6.4** 


3.P* 


.8 


25.4* • 


1.2* 




3.3** 


4.9*» 


1.1 


12.5** 


5.2*» 


l.8» 






.3 



Note. Separate analyses of variance ( anovas) were performed for each of the five evaluation factors, the sum 
of the 18 rating items (Total rating), and the achievement test. Pirat, separate two-way anovas (Expressiveness 
X Content) were performed for each of the three incentive conditions, and then three-way anova's (Incentive 
X Expressiveness X Content) were performed for all the daU. The effect sizes were defmed as (SS«fr«ct/SSuiui) 
X100%. 

•p<.06. ♦♦p<.01. 
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Ch^pt:^!- 2L5_ Ut: ill tiv q± St: tJiclent: Rgit: i nas 

I mpr-ove>mer>-ti o-f I ns-tr-iaczt: i on - 

The introduction o-f a broad institution-based, carefully 
planned program o-f students' evaluations o-f teaching 
e-f -f ecti veness is likely to lead to the improvement of 
teaching because: 

1) faculty will have to give serious consideration to 
their own teaching in order to evaluate the merits of the 
program; 

2) the institution of a program which is supported by the 
administration will serve notice that teaching effectiveness 

is being taken more seriously by the administrative 
hi erarchy . 

3) the results of student ratings, as one indicator of 
effective teaching, will provide a basis for informed 
administrative decisions and thereby increase the likelihood 
that q'iality teaching will be recognized and rewarded, and 
that good teachers will be kept. 

4) the social reinforcement of getting favorable ratings 

will provide added incentive for the improvement of teaching, 
even for tenured faculty. 

5) faculty report that the feedback from students' 

evaluations is useful to their own efforts for the 
improvement of their teaching. 

ttttNone of these observations, however, provides an 
empirical demonstration of improvement of teaching 
effectiveness resulting from students' evaluations. 

Ffaedback Studies, 

In most studies of the effects of feedback from students' 
evaluat i ons: 



1) classes are randomly assigned to experimental or 
control groups; 

2) students' evaluations are collected near the middle of 
the term; 

3) at least the ratings from one or more groups are 
returned to instructors as quickly as possible; 

4) the various groups are compared at the end of the tern 

FR?r^ svcond administration of student ratings and as well as 
^^>r variables. 

25 
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In FEEDBACK studies using SEEQ in multiple sections of the 
same courses 

Study 1.. Results from an abbreviated form of the survey 

Mere simply returned to faculty, and the impact of the 
feedback Mas positive, but very modest. 

Study Zjl. Here I actually met Mith instructors in the 
feedback group to discuss the evaluations and possible 

fi:;i:rategi es for improvement. In this study students in the 
feedback group subsequently performed better on a 

standardized final examination, rated teaching effectiveness 
more favorably at the end of the course, and experienced more 

favorable affective outcomes at the end of the course (i.e., 
feelings of course mastery, and plans to pursue and/or apply 

the subject ) . 

tttThese two studies suggest that feedback, coupled with a 
candid discussion with an external consultant, can be an 

effective intervention for the improvement of teaching 
ef f ecti veness. 

Remai ni nq I ssues 

Several issues still remain for FEEDBACK research. 

1) How much of the observed effect is due to 

consultation that does not depend on feedback from student 
rat i ngs? 

2) Nearly all of the feedback studies were based on 

midterm feedback from midterm ratings. This limitation, 
perhaps, weakens the likely effects in that mstny 

instructional characteristics cannot be easily altered in the 
second half of the course. This approach also requires 

further study of the generality of this approach to the 
effects of end-o -term ratings in one term to subsequent 

teaching that is more typical. 

3) reward structure is an important yariable which has not 
been examined in this feedback research. Even though faculty 

may be intrinsically motivated to improve their teaching 
effectiveness, potentially valuable feedback will be much 

less useful if there is no extrinsic motivation for faculty 
to improve. To the extent that salary, promotion, and 

prestige are based almost exclusively on research 
productivity, the usefulness of student ratings as feedback 

for the improvement of teaching may be limited. 

4) There has been too little systematic research oi the 
usefulness of students' evaluations for the other purposes 

fo^ which they are intendedi personnel decisions, student 
gpj^(^ rucior/course selection, and^ research on teaching. 



Table 9 

F Values for Differences Between Students With Either Feedback or No- 
Feedback Instructors For End-of-Term Ratings, Final Exam Performance, and 
Affective Course Consequences (Reprinted with permission from Overall and 
Marsh, 1979; see original article for more details of the analysis). 



Group 



Feedback* 



NofeedUck^ 



Virtebic 



h4 



SD 



SD DifTcrcnoe F(l.e01) 



Ratinf oompofwnU 

Conctrn 

Breadth 

Intcractioa 

Orftniution 

LeerninKA^alue 

ExamsA^rading 

WorUoad/Difnculty 

Overall Ittitnidor 

OveraH Coune 

Instnictional Improvement 
Final eiam performance 
Affective ooune oontequenoee 

Prcfranminc oompetenoe achieved 

Computer undentanding gained 

Futwe computer uaa planned 

Future comput>er application planned 

Further related counework planned 



62.38 


8.5 


49^1 


10.1 


2.87 


19.1** 


S0.84 


7.9 


49.59 


7.9 


1.25 


4.8* 


51.94 


7.4 


48.61 


10.3 


3.33 


32.4** 


49.88 


9.4 


60.88 


9.6 


-1.00 


2.5 


60.77 


9.9 


48.22 


10.7 


2.56 


11.7** 


60.62 


9.9 


49.08 


10.1 


1.44 


4.1* 


61.13 


8.8 


61.61 


8.8 


-.38 


.4 


7.00 


1.6 


6.33 


2.1 


.67 


26.4** 


5.81 


1.8 


6.39 


2.0 


.42 


5.4* 


5.97 


1.6 


6.49 


1.5 


.48 


16.0** 


61.34 


9.9 


49.41 


10.1 
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Chap tier- B = ThLSL Use e±. Stu-tdent: HQI^ iO- 

Pi -f -Fer-ent: Couin t:i- i »s a T±1S. 

Students' evaluations are collected in most North American 
Universities, but not in other parts of the world and not in 
secondary institutions. The Applicability Paradigm is 
designed to test the applicability of two rating instruments 
— my SEEQ and Peter Prey's Endeavor — to other countries. A 
representative sample of students is asked to» 

a) select a "best" and a "worst** teacher, 

b) rate each using SEEQ and Endeavor, 

c) indicate inappropriate items, and 

d) select the most important items 



Analyses of the results included: 

a) a discrimination of "best" and "worst" teachers 

b) comparisons of "inappropriate" and "most important" items. 

c) factor analyses of SEEQ and Endeavor responses 

d) multitrait-mul timethod analyses of relations between SEEQ 
and Endeavor scales 



The apol icabi 1 i ty paradigm has been used in Spain, Papua New 
Guinea, New Zealand, Indonesia, and two different tertiary 
settings in Australia. In each study most items were judged 
to be appropriate and chosen by at least some as most 
important, and all but Morkload/Dif f icul ty items 
differentiated between good and poor teachers. There was a 
surprising consistency in the items chosen as most important 
and inappropriate across the studies. Factor analyses 
identified most of the factors the instruments were designed 
to measure. The MTMM analyses provided support for both the 
convergent and discriminant validity of the response* to the 
two instruments. The studies suggest that students in 
different countries do differentiate among different 
components of effective teaching In a way similar to North 
American students when responding to SEEQ and Endeavor. 

Based on these studies, the Applicability Paradigm appears to 
provide a useful initial study in evaluating the 
applicability of students' evaluations of teaching 
effectiveness In a new setting. 
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Research reviewed shows that student ratings ares 

1) multidimensional; 

2) reliable and stable; 

3) primarily a function of the instructor who 
teaches a course rather than of the course that is 

taught ; 

4) relatively valid against a variety of indicators 
of effective teaching; 

5) relatively unaffected by a variety of variables 
hypothesized as potential biases; 

6) seen to be useful by faculty as feedback about 
their teaching, by students for course selection, and 

by administrators for use in personnel decisions. 



However, the same findings also demonstrate that 

student ratings have some faults, and they are viewed 
with some skepticism by faculty as a basis for 

personnel decisions. 

This level of uncertainty probably also exists for all 
personnel evaluations — particularly among those 
being evaluated. Students' evaluations of teaching 
effectiveness are probably the most thoroughly studied 

form of personnel evaluation, and one of the best in 
terms of being supported by empirical research. 



Al ternati ve Indicators of Ef f ecti ve Teachi no 

Despite the generally supportive research 
findings, student ratings should be used cautiously. 

There should be other forms of systematic input about 
teaching effectiveness, particularly for personnel 

deci sions. 

Nhereas there is good evidence to support the use 
of students' evaluations as one indicator of effective 

teaching, there are few other indicators of teaching 
effectiveness whose use is systematically supported by 

research findings. 

Extensive lists of alternative indicators of effective 
teaching are proposed, but few are supported by 
systematic research, and none are as clearly supported 
O students' evaluations of teaching. 



